Skip to content

Add C++ runtime for spleeter about source separation #2242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 23, 2025

Conversation

csukuangfj
Copy link
Collaborator

Usage

Build sherpa-onnx and download model files

git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake ..
make

# go to https://github.com/k2-fsa/sherpa-onnx/releases/tag/source-separation-models

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/sherpa-onnx-spleeter-2stems-fp16.tar.bz2

tar xvf sherpa-onnx-spleeter-2stems-fp16.tar.bz2

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/qi-feng-le-zh.wav
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/source-separation-models/audio_example.wav

Run it with audio_example.wav

./bin/sherpa-onnx-offline-source-separation \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --num-threads=2 \
  --debug=0 \
  --input-wav=audio_example.wav \
  --output-vocals-wav=output_vocals.wav \
  --output-accompaniment-wav=output_accompaniment.wav

The output logs are

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline-source-separation --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx --num-threads=2 --debug=0 --input-wav=audio_example.wav --output-vocals-wav=output_vocals.wav --output-accompaniment-wav=output_accompaniment.wav

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx", accompaniment="sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx"), num_threads=2, debug=False, provider="cpu"))
Started
Done
Saved to write to 'output_vocals.wav' and 'output_accompaniment.wav'
num threads: 2
Elapsed seconds: 0.469 s
Real time factor (RTF): 0.469 / 10.919 = 0.043

Note: It runs on my macOS (x64) with CPU.

audio_example.wav

audio_example.mov

output_vocals.wav for audio_example.wav

output_vocals.mov

output_accompaniment.wav for audio_example.wav

output_accompaniment.mov

Run it with qi-fen-le-zh.wav

./bin/sherpa-onnx-offline-source-separation \
  --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx \
  --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx \
  --num-threads=2 \
  --debug=0 \
  --input-wav=./qi-feng-le-zh.wav \
  --output-vocals-wav=output_vocals.wav \
  --output-accompaniment-wav=output_accompaniment.wav

Output logs are

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:372 ./build/bin/sherpa-onnx-offline-source-separation --spleeter-vocals=sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx --spleeter-accompaniment=sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx --num-threads=2 --debug=0 --input-wav=./qi-feng-le-zh.wav --output-vocals-wav=output_vocals.wav --output-accompaniment-wav=output_accompaniment.wav

OfflineSourceSeparationConfig(model=OfflineSourceSeparationModelConfig(spleeter=OfflineSourceSeparationSpleeterModelConfig(vocals="sherpa-onnx-spleeter-2stems-fp16/vocals.fp16.onnx", accompaniment="sherpa-onnx-spleeter-2stems-fp16/accompaniment.fp16.onnx"), num_threads=2, debug=False, provider="cpu"))
Started
Done
Saved to write to 'output_vocals.wav' and 'output_accompaniment.wav'
num threads: 2
Elapsed seconds: 1.262 s
Real time factor (RTF): 1.262 / 26.102 = 0.048

qi-fen-le-zh.wav

qi-feng-le-zh.mov

vocals for qi-fen-le-zh.wav

output_vocals.mov

accompaniment for qi-fen-le-zh.wav

output_accompaniment.mov

Fixes #2235

CC @acely

@csukuangfj csukuangfj merged commit 716ba83 into k2-fsa:master May 23, 2025
184 of 218 checks passed
@csukuangfj csukuangfj deleted the spleeter-cpp branch May 23, 2025 14:31
@acely
Copy link

acely commented May 23, 2025

太效率了,必须致敬一下!
不过说回来,spleeter的效果并不理想,他最初是为了「去除人声」且获得伴奏而研发的,所以输出的伴奏要比人声部分质量好,人声听起来很糊。
所以说我还是建议试试mdx-net模型,毕竟他那个Voc_FT是专门为了「提取人声」而训练的。

@acely
Copy link

acely commented May 23, 2025

我用Voc_FT提取了上面两段音频,效果可以对比下。

1_main.mov
2_main.mov

@csukuangfj
Copy link
Collaborator Author

太效率了,必须致敬一下!
不过说回来,spleeter的效果并不理想,他最初是为了「去除人声」且获得伴奏而研发的,所以输出的伴奏要比人声部分质量好,人声听起来很糊。
所以说我还是建议试试mdx-net模型,毕竟他那个Voc_FT是专门为了「提取人声」而训练的。

请问你说的这个模型,在cpu上的速度如何?能否测试下RTF?

@acely
Copy link

acely commented May 24, 2025

我的测试机器是Macmini M4Pro芯片,用Java调用onnxruntime加载的。
测试素材为45分钟的音频:
-纯CPU耗时850秒,RTF=0.3148
-开启CoreML支持耗时350秒,RTF=0.13

@dfengpo
Copy link
Contributor

dfengpo commented May 25, 2025

mdx-net

你这个效果太惊艳了,请问哪里可以下载mdx-net模型的onnx版本,我非常需要这个模型。顺便问下比如商场这种人流噪音下的录音也一样可以处理吗?

@acely
Copy link

acely commented May 25, 2025

@acely
Copy link

acely commented May 25, 2025

@dfengpo 降噪我不确定,手里没有相应的测试素材

@dfengpo
Copy link
Contributor

dfengpo commented May 29, 2025

@acely

请问下你上面的分离示例,是使用以下脚本跑出来的结果吗?
https://github.com/acely/uvr-mdx-infer/blob/main/separate.py
我看源作者说他有更简单的实现:
https://github.com/seanghay/vocal/blob/main/vocal/__init__.py
想确认一下你是用哪个脚本跑出来的结果,非常感谢

@acely
Copy link

acely commented May 29, 2025

@acely

请问下你上面的分离示例,是使用以下脚本跑出来的结果吗? https://github.com/acely/uvr-mdx-infer/blob/main/separate.py 我看源作者说他有更简单的实现: https://github.com/seanghay/vocal/blob/main/vocal/__init__.py 想确认一下你是用哪个脚本跑出来的结果,非常感谢

用的这个https://github.com/acely/uvr-mdx-infer/blob/main/separate.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

建议增加语音分离功能,可以从噪声、背景音乐中分离出干净人声以更好的ASR
3 participants